Let us start by saying that the session for today is going to contain much less information, but is perhaps more difficult to internalise. It will take you some time to get used to Git. Don’t worry, this is new for most people.
We will start by going through version control and at the end we will talk a bit more about make. In particular we are going to talk about how you can automate an R project with make.
The lecture will be quite hands on and will help you get started with Git. Please note that this is a very basic introduction and does not constitute a deep dive into Git.
First, we need to do a quick check to see if all the relevant software has been installed. Make sure you have,
RRStudioGitGithub accountmakeIf you have these tings installed you are ready to go!
The notebook for the lecture today follows the slides by Grant McDermott, which can be found here. Some of these notes are directly copied from his slides with Grant’s permission. I recommend you go and check out his notes. He has some really good material.
We are going to be only covering the bare essentials for Git today. If you want a really good book to learn more about interaction between Git, Github and R I highly recommend the following by Jenny Bryan – https://happygitwithr.com/
Most of you probably use a horribly inefficient naming convention as a makeshift version control system. Does the following look familiar,
Final Draft.docxFinal Draft 1.docxFinal Draft 22 Nov.docxFinal Draft 22 Nov Comments.docFinal Draft 22 Nov Supervisor Feedback.docxFinal Final Draft.docxFinal Final Draft 1.docxIf you are guilty of this terrible renaming of the same document then off to the gallows!
Don’t worry, I work with many people who do this. However, while working on my PhD I realised that I couldn’t save a million different versions of the same document with similar names. I was asking myself, which document is the right one? What changes did I make to my document? Why is my PhD folder 15GB in size?!?
The easiest form of version control if you are working on a project on your own is to use Dropbox / Google Drive / OneDrive / Box , etc. I don’t think it is always required to use git for every project, especially if you are working on your own. These online backup systems have their own built-in ways of providing version control.
I think git really starts to shine when you are working on projects with other people. I remember that things became quite problematic after my PhD when I was trying to work with other people! Collaboration without git was really painful. I started using git in 2014 and I believe it is one of the best investments I have made.
Like I mentioned above, the original goal of git was for collaboration on big projects. You start with a respository and then everyone gets to work on the repository, where “track changes” are recorded.
In the data science space git is also used to store more than source code. Normally a data science project will contain, data, figures, reports and source code.
In this lecture we are going to try and establish one way in which you can introduce git into your normal workflow. Initially it might seem strange, but after a while it will be second nature.
Note: We are working with an
Rproject here. The same should be applicable to other languages likeJuliaorPython.
In order to get properly setup for a project you are going to have to do the following,
Rstudio projectgit repositoryAll of my projects have this basic structure. There are many details and nuances related to these steps, but this are the basic principles. At the end of the day working with git is going to be no different than saving your project and sending it to Github every now and then.
If you are going to be coding in R, then RStudio offers some really nice integration with Github. The same is true for VS Code and Github, if you are thinking about programming in other languages.
Note: Before you follow the steps above you need to setup the connection between
GithubandRStudiothrough a personal access token. Please read the instructions on how to do this here. You could also set up keys forSSH, which is actually the preferred method, but perhaps a bit more complex. Instructions can be found here.
We are going to start by linking an RStudio project to a Github repository. The steps are going to be as follows,
Github and initialise with a README.RStudioWe will do this practically in class so that you can see how it works.
Below is an animated guide (gif) to see how to do Steps 1 and 2.
For the first step, you can just call your repo DataScienceTest. If everyone has the same name for the repo it will make things easier down the line.
For Step 3, 4, 5 consider the animated guide below.
If you have done everything properly then you should be able to see a Git tab to the right of the Connections tab in RStudio.
Open the README file that you created when initialising the repo and type something in there. You should see some changes in the Git panel.
There are many graphical user interfaces that you can work with instead of relying on RStudio. I prefer to work with GitKraken, since there are many features that are useful when working on big projects with other people. There is a free version, but the Pro version has some really cool draws.
In my opinion GitHub Desktop is the easiest to get started with for the beginner. You can slowly migrate to other software packages once you understand Git a bit better.
GitThere are four main Git operations,
Stage and commit normally occur together. So does pushing and pulling.
Let us stage and commit changes to our README file. Then we can push our local changes to the GitHub repo.
NB Always
pullfrom the upstream repo before you push any changes. This makes sure the local repo is up to date.
Git at the command lineThere is always the option to forgo GUIs entirely and operate everything through the terminal. While GitHub and Rstudio is ideal for new users there is a case to be made for knowing shell commands. There are some things that you can easily do through the shell that is not possible with the RStudio Git GUI.
In addition, you might be working with projects that don’t focus primarily on R. I only use a handful of shell commands in my daily workflow, so I won’t burden you with too many. The easiest command is to clone a repo. I use this a lot.
$ git clone REPOSITORY-URLYou can test this out by cloning the DataScience-871 repo for this course. If you wanted do this, you can cd into the appropriate directory where you want to save the content of the repo and issue the following command.
$ git clone https://github.com/DawievLill/DataScience-871Now switch back to your test repo, DataScienceTest, that you created before. You must cd back to the location of this repo on your computer. Let me know if you are struggling with this.
We can see the commit history with the following command,
$ git logWe can also check which files have changes with the following,
$ git statusWe can stage a file, or group of files, as follows
$ git add NAME-OF-FILE-OR-FOLDERYou can use wildcard characters to stage a group of files. There are a bunch of useful flag options too:
Stage all files.
$ git add -AStage updated files only (modified or deleted, but not new).
$ git add -uStage new files only (not updated).
$ git add .Commit your changes.
$ git commit -m "Helpful message"Pull from the upstream repository (i.e. GitHub).
$ git pullPush any local changes that you’ve committed to the upstream repo (i.e. GitHub).
$ git pushRemember to always pull before you push to GitHub.
Branches are an important feature of Git and you will make use of it when you work on collaborative projects. A branch allows you to take a snapshot of the repo and then try out some new ideas without affecting the main branch. Once you are satisfied with your changes you can try and merge back into the main branch.
You can create a new branch in many ways. You can use Rstudio, VS Code, GitKraken, the command line, etc. We will quickly show how to do this with RStudio in the lecture. However, if you wanted to do this with a shell command you could do create a new branch on your local machine and switch to it:
$ git checkout -b NAME-OF-YOUR-NEW-BRANCHPush the new branch to GitHub:
$ git push origin NAME-OF-YOUR-NEW-BRANCHList all branches on your local machine:
$ git branchSwitch back to (e.g.) the master branch:
$ git checkout masterDelete a branch
$ git branch -d NAME-OF-YOUR-FAILED-BRANCH
$ git push origin :NAME-OF-YOUR-FAILED-BRANCHAnother important topic for collaboration is forking. If you create a fork of a repository you are creating your own copy of the original repository. You can now work on your own version. When you are ready you can submit changes to the original repository through a pull request. One good exercise is to fork the DataScience-871 repository and then look for spelling mistakes in the notes. You can then correct the mistake and submit a pull request. We will talk about how to do this in a second.
You have two options for merging branches / forks:
You can merge locally. Commit changes to a new branch. You can switch between main and new branch using the checkout command. Merge the new branch using the merge command.
$ git merge new-ideaYou can merge remotely (normally by creating a fork of the original repository). Merging remotely via pull requests is a way to notify collaborators that you have completed some feature. You provide a neat summary of all the changes that you made in your branch. Normally the pulled request is then reviewed and can then be approved. Once approved you will the pull request will be incorporated / merged on GitHub.
We are going to try and practice this in class by forking the DataScience-871-Exercise module and then making some changes to a document. For this exercise you will create your own branch with your name attached and then we will see what the pull-request process looks like.
Now we return to make, one of the most important tools for reproducible research.